[model] add support for qwen3vl series #16780

JJJYmmm · 2025-10-26T12:58:54Z

This PR adds support for the Qwen3-VL series, including both the dense and MoE variants.
The original implementation was contributed by @yairpatch and @Thireus (see #16207). @LETS-BEE also helped address issues such as weights loading.

In this PR, I’ve fixed several algorithmic implementation details (e.g., deepstack), added support for MRoPE-Interleave, and performed final code cleanup.

Co-authored-by: Thireus ☠ <[email protected]> Co-authored-by: yairpatch <[email protected]> Co-authored-by: LETS-BEE <[email protected]>

Thireus · 2025-10-26T14:58:45Z

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases - tagged tr-qwen3-vl-6

ddh0 · 2025-10-27T05:11:01Z

Thank you! Looking forward to this so we (myself and @rujialiu) can progress with #16600 :)

xbl916 · 2025-10-27T11:01:07Z

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611

For some reason, this version's OCR capability is not as good as the previous LETS-BEE version; it noticeably misses characters and exhibits infinite repetition.

@JJJYmmm

Integrates Qwen3-VL and Qwen3VL-MoE architecture support from upstream. Implements IMROPE (Interleaved Multi-resolution RoPE) for vision models. Adds deepstack layer support for visual feature processing. Changes include: - New architecture types: LLM_ARCH_QWEN3VL, LLM_ARCH_QWEN3VLMOE - IMROPE rope type for vision position encoding - Deepstack visual feature handling in clip.cpp - GGML CUDA kernels for IMROPE - Tensor mappings for Qwen3VL architecture Upstream PR: ggml-org/llama.cpp#16780 Contributors: @JJJYmmm @yairpatch @Thireus @LETS-BEE

theo77186 · 2025-10-27T11:27:26Z

the question is: are the fixes in #16745 included in this PR? If not, the full performance of the model will only be reached with PR 16475 merged.

psi00 · 2025-10-27T13:39:47Z

Thank you @JJJYmmm! Test builds:

https://github.com/Thireus/llama.cpp/releases/tag/tr-qwen3-vl-6-b7106-495c611

I'm still getting an unknown model architecture error here?

ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Apps\llama.cpp\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Apps\llama.cpp\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Apps\llama.cpp\ggml-cpu-haswell.dll
build: 7106 (495c6115) with clang version 19.1.5 for x86_64-pc-windows-msvc
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) (0000:07:00.0) - 11240 MiB free
llama_model_loader: max stdio successfully set to 2048
llama_model_loader: loaded meta data with 21 key-value pairs and 399 tensors from C:\models\llama.cpp\Qwen3-VL-8B-Instruct.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = Qwen3-VL-8B-Instruct
llama_model_loader: - kv   1:                                    version u32              = 3
llama_model_loader: - kv   2:                               tensor_count u32              = 399
llama_model_loader: - kv   3:                               general.type str              = model
llama_model_loader: - kv   4:                         general.size_label str              = 8B
llama_model_loader: - kv   5:                               bos_token_id u32              = 151643
llama_model_loader: - kv   6:                               eos_token_id u32              = 151645
llama_model_loader: - kv   7:                                 hidden_act str              = silu
llama_model_loader: - kv   8:                                hidden_size u32              = 4096
llama_model_loader: - kv   9:                          intermediate_size u32              = 12288
llama_model_loader: - kv  10:                    max_position_embeddings u32              = 262144
llama_model_loader: - kv  11:                        num_attention_heads u32              = 32
llama_model_loader: - kv  12:                          num_hidden_layers u32              = 36
llama_model_loader: - kv  13:                        num_key_value_heads u32              = 8
llama_model_loader: - kv  14:                               rms_norm_eps f32              = 0.000001
llama_model_loader: - kv  15:                                 rope_theta f32              = 5000000.000000
llama_model_loader: - kv  16:                             attention_bias bool             = false
llama_model_loader: - kv  17:                                   head_dim u32              = 128
llama_model_loader: - kv  18:                        tie_word_embeddings bool             = false
llama_model_loader: - kv  19:                                 vocab_size u32              = 151936
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q4_0:  254 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0 (guessed)
print_info: file size   = 4.29 GiB (4.50 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'Qwen3-VL-8B-Instruct'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'C:\models\llama.cpp\Qwen3-VL-8B-Instruct.Q4_0.gguf', try reducing --n-gpu-layers if you're running out of VRAM
main: error: unable to load model```

i4TsU · 2025-10-27T13:55:51Z

the question is: are the fixes in #16745 included in this PR? If not, the full performance of the model will only be reached with PR 16475 merged.

they are not, as @FMayran and @rujialiu are still figuring out the best way to implement a fix properly, once and for all :) . you can cherry-pick the changes from #16745 without any problems though, and then just build it yourself for a temporary implementation, though make sure to check the issues raised in the last 24-48 hours re why its not a real 100% fix

PaymonHossaini · 2025-10-27T14:07:52Z

@psi00

I have managed to get Qwen3-LV-30B-A3B-instruct running on Ubuntu just now (specifically with a ryzen ai max+ 395 and vulkan). Did you compile your own GGUF/mmproj.gguf using convert_hf_to_gguf.py?

How I prepared mine below

huggingface-cli download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir tmp/Qwen3-VL-30B-A3B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models <for model

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj < for mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-30B-A3B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999 < to run llama.cpp

No GGUFs I found off the shelf were working right until I did this. Hope this helps.

psi00 · 2025-10-27T14:21:59Z

@psi00

I have managed to get Qwen3-LV-30B-A3B-instruct running on Ubuntu just now (specifically with a ryzen ai max+ 395 and vulkan). Did you compile your own GGUF/mmproj.gguf using convert_hf_to_gguf.py?

How I prepared mine below

huggingface-cli download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir tmp/Qwen3-VL-30B-A3B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models <for model

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-30B-A3B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj < for mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-30B-A3B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999 < to run llama.cpp

No GGUFs I found off the shelf were working right until I did this. Hope this helps.

Thank you. I was using the GGUFs from NexaAI. May I add though that I think the architecture is different for each model (30B/8B/4B) etc. I will try this though, thanks again

tools/mtmd/clip.cpp

ngxson · 2025-10-27T14:23:21Z

tools/mtmd/clip.cpp

+                    deepstack_features = feat;
+                } else {
+                    // concat along the feature dimension
+                    deepstack_features = ggml_concat(ctx0, deepstack_features, feat, 0);


not very important to optimize this right now, but doing ggml_concat at on multiple layers can increase memory usage. one trick is to allocate one big tensor, then use ggml_set_rows to copy the intermediate result into the allocated tensor.

cc @ggerganov , do you think this can be a good idea for concat multiple tensors?

Oh, I just follow the style of llava

llama.cpp/tools/mtmd/clip.cpp

Lines 1278 to 1285 in 1c1409e

// If feature layers are explicitly set, stack them (if we have multiple)

if (!embedding_stack.empty()) {

embeddings = embedding_stack[0];

for (size_t i = 1; i < embedding_stack.size(); i++) {

embeddings = ggml_concat(ctx0, embeddings, embedding_stack[i], 0);

}

}

}

Yes but llava has fixed number of token (no dynamic resolution), so the memory usage is predictable

Got it, I’ll optimize it later. 🫡

Add a TODO comment with a reference to this thread to not forget to improve this later.

Suggested change

deepstack_features = ggml_concat(ctx0, deepstack_features, feat, 0);

// TODO: pre-allocate memory and use ggml_set_rows, see: https://github.com/ggml-org/llama.cpp/pull/16780/files#r2465886647

deepstack_features = ggml_concat(ctx0, deepstack_features, feat, 0);

tools/mtmd/clip.cpp

psi00 · 2025-10-27T14:49:47Z

@PaymonHossaini,
I get another architecture error when trying to quantize:

python .\Qwen3-VL-8B-Instruct\llama.cpp\convert_hf_to_gguf.py --outtype f16 .\Qwen3-VL-8B-Instruct\ --use-temp-file --outfile models
INFO:hf-to-gguf:Loading model: Qwen3-VL-8B-Instruct
INFO:hf-to-gguf:Model architecture: Qwen3VLForConditionalGeneration
ERROR:hf-to-gguf:Model Qwen3VLForConditionalGeneration is not supported

PaymonHossaini · 2025-10-27T17:38:28Z

@PaymonHossaini, I get another architecture error when trying to quantize:

python .\Qwen3-VL-8B-Instruct\llama.cpp\convert_hf_to_gguf.py --outtype f16 .\Qwen3-VL-8B-Instruct\ --use-temp-file --outfile models
INFO:hf-to-gguf:Loading model: Qwen3-VL-8B-Instruct
INFO:hf-to-gguf:Model architecture: Qwen3VLForConditionalGeneration
ERROR:hf-to-gguf:Model Qwen3VLForConditionalGeneration is not supported

While its true the the 30B is MOE and the 8B is dense I was unable to recreate this issue. Make sure your local checkout tracks the PR branch as there were some changes to that script to make it compatible with these models.

My instructions for using 8B model below

huggingface-cli download Qwen/Qwen3-VL-8B-Instruct --local-dir tmp/Qwen3-VL-8B-Instruct --local-dir-use-symlinks False --include "*.json" "*.safetensors" "preprocessor_config.json"

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-8B-Instruct --outtype f16 --use-temp-file --outfile models

CUDA_VISIBLE_DEVICES="" HF_HOME=~/projects/llamajjy/llama.cpp/tmp python3 convert_hf_to_gguf.py tmp/Qwen3-VL-8B-Instruct --outtype f16 --use-temp-file --outfile models --mmproj

build-vulkan/bin/llama-server -m models/Qwen3-VL-8B-Instruct-F16.gguf --mmproj models/mmproj-Qwen3-VL-8b-Instruct-F16.gguf --jinja --host 0.0.0.0 --port 8081 -ngl 999

I don't belive this issue is a result of the code changes.

LETS-BEE · 2025-10-27T22:53:39Z

For some reason, this version's OCR capability is not as good as the previous LETS-BEE version; it noticeably misses characters and exhibits infinite repetition.

I think merging PR #16745 will likely reflect the model's original performance.
However, I don't know why, but using nearest-neighbor interpolation instead of bilinear interpolation in resize position embeddings seems to yield better performance.

ggml/src/ggml-cpu/ops.cpp

src/llama-kv-cache.cpp

JJJYmmm · 2025-10-30T11:06:38Z

@CISC I’ve updated the corresponding file, but haven’t tested it yet since I don’t have a vulkan env at the moment.

0cc4m · 2025-10-30T11:15:57Z

GLSL cannot automatically convert integers to bool, so you need the full condition, for example if (p.is_imrope) { has to be if (p.is_imrope != 0) {

ngxson · 2025-10-30T11:23:08Z

convert_hf_to_gguf.py

+        self.is_deepstack_layers = [False] * int(self.hparams_vision["num_hidden_layers"] or 0)
+        for idx in self.hparams_vision.get("deepstack_visual_indexes", []):
+            self.is_deepstack_layers[idx] = True


(No actions is needed, just a side note here)

The is_deepstack_layers metadata is no longer being used in clip.cpp, as I want to make the code more simple to maintain. We now use the same logic as MoE in llama.cpp, where if the tensor is not present, it will be nullptr, and this will trigger the code branch for deepstack layers

Bu we will still keep this metadata in GGUF for future use

gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

pt13762104 · 2025-10-30T11:38:10Z

I see an error in the mmproj creation: ValueError: Can not map tensor 'visual.blocks.0.attn.qkv.bias'

ngxson · 2025-10-30T13:19:51Z

As a reminder, you can also add different backends support in follow-up PRs, to avoid adding too many reviewers into one PR (More preferable, one PR per backend)

ngxson · 2025-10-30T14:40:28Z

I'm merging this in the next 30mn - 1hr as the CI for test-backend-ops already passed. Thanks for the contribution @JJJYmmm !

JJJYmmm · 2025-10-30T14:54:28Z

Thank you all for the detailed review! 🙏

SharkWipf · 2025-10-30T15:09:40Z

I never watched a llama.cpp PR thread before, never realized how well-organized and dedicated you all are, just wanted to chime in to say: you all rock and your effort is appreciated.

mpapili · 2025-10-30T15:13:21Z

I never watched a llama.cpp PR thread before, never realized how well-organized and dedicated you all are, just wanted to chime in to say: you all rock and your effort is appreciated.

Ditto. I've been watching this PR like a hawk. Great contributors and great maintainers all around.

RodriMora · 2025-10-30T16:07:13Z

I believe the requirements.txt needs top be updated, the current transformers version does not have support for the qwen3-vl architecture. Not a problem for inference, but for quantizing it will not recognize the arch

JJJYmmm and others added 2 commits October 26, 2025 19:18

support qwen3vl series.

1e4fd19

Co-authored-by: Thireus ☠ <[email protected]> Co-authored-by: yairpatch <[email protected]> Co-authored-by: LETS-BEE <[email protected]>

bugfix: fix the arch check for qwen3vl-moe.

f84bd67

JJJYmmm requested review from CISC, ggerganov, ngxson and slaren as code owners October 26, 2025 12:58

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Oct 26, 2025

Thireus mentioned this pull request Oct 26, 2025

Created branch for builds - Add support for qwen3vl series by @JJJYmmm Thireus/llama.cpp#27

Merged

taronaeo linked an issue Oct 26, 2025 that may be closed by this pull request

Feature Request: support qwen3-vl series #16207

Closed

4 tasks

ngxson reviewed Oct 27, 2025

View reviewed changes

fixu124 mentioned this pull request Oct 27, 2025

Feature Request: support qwen3-vl series #16207

Closed

4 tasks

ngxson reviewed Oct 28, 2025

View reviewed changes

ggml/src/ggml-cpu/ops.cpp Show resolved Hide resolved

use build_ffn

0443a09

ngxson reviewed Oct 28, 2025

View reviewed changes

src/llama-kv-cache.cpp Show resolved Hide resolved

optimize deepstack structure

3271877

JJJYmmm requested a review from 0cc4m as a code owner October 30, 2025 11:01

fix vulkan

950c764

ngxson reviewed Oct 30, 2025

View reviewed changes

Merge branch 'master' into add_qwen3vl

19a458f

CISC reviewed Oct 30, 2025

View reviewed changes

gguf-py/gguf/tensor_mapping.py Show resolved Hide resolved

webgpu: add imrope w/o check

0bed5d8

JJJYmmm requested a review from reeselevine as a code owner October 30, 2025 11:31

Update gguf-py/gguf/tensor_mapping.py

b338010

Co-authored-by: Sigbjørn Skjæret <[email protected]>

github-actions bot added the Vulkan Issues specific to the Vulkan backend label Oct 30, 2025

fix tensor mapping

7d9c149

Thireus mentioned this pull request Oct 30, 2025

Bulk fixes + catchup with master Thireus/llama.cpp#36

Merged

ggerganov approved these changes Oct 30, 2025

View reviewed changes

ngxson merged commit d261223 into ggml-org:master Oct 30, 2025
71 of 73 checks passed

ngxson mentioned this pull request Oct 30, 2025

Improve layer stacking in clip.cpp #16863

Open

RodriMora mentioned this pull request Oct 30, 2025

Update requirements-convert_legacy_llama.txt #16866

Merged

ddh0 mentioned this pull request Oct 30, 2025

support GLM-4.5V vision model #16600

Draft

Thireus mentioned this pull request Oct 30, 2025

Port of Qwen3-VL support from mainline ikawrakow/ik_llama.cpp#883

Open

4 tasks

github-actions bot mentioned this pull request Oct 31, 2025

Reddit News Daily 2025-10-31 gitlawr/reddit-daily-news#49

Open

ayayakirara mentioned this pull request Oct 31, 2025

Eval bug: Qwen 3 VL provides incorrect bounding boxes on sizes that are not 1000x1000px / non-square #16880

Closed

mudler mentioned this pull request Oct 31, 2025

chore(model gallery): 🤖 add 1 new models via gallery agent mudler/LocalAI#6933

Open

codingl2k1 mentioned this pull request Oct 31, 2025

ENH: Update llama.cpp to b6900 xorbitsai/xllamacpp#88

Open

	// If feature layers are explicitly set, stack them (if we have multiple)
	if (!embedding_stack.empty()) {
	embeddings = embedding_stack[0];
	for (size_t i = 1; i < embedding_stack.size(); i++) {
	embeddings = ggml_concat(ctx0, embeddings, embedding_stack[i], 0);
	}
	}
	}

	deepstack_features = ggml_concat(ctx0, deepstack_features, feat, 0);
	// TODO: pre-allocate memory and use ggml_set_rows, see: https://github.com/ggml-org/llama.cpp/pull/16780/files#r2465886647
	deepstack_features = ggml_concat(ctx0, deepstack_features, feat, 0);

[model] add support for qwen3vl series #16780

[model] add support for qwen3vl series #16780

Conversation

JJJYmmm commented Oct 26, 2025

Uh oh!

Thireus commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddh0 commented Oct 27, 2025

Uh oh!

xbl916 commented Oct 27, 2025

Uh oh!

theo77186 commented Oct 27, 2025

Uh oh!

psi00 commented Oct 27, 2025

Uh oh!

i4TsU commented Oct 27, 2025

Uh oh!

PaymonHossaini commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psi00 commented Oct 27, 2025

Uh oh!

Uh oh!

ngxson Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

JJJYmmm Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

JJJYmmm Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

JJJYmmm Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

psi00 commented Oct 27, 2025

Uh oh!

PaymonHossaini commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

My instructions for using 8B model below

Uh oh!

LETS-BEE commented Oct 27, 2025

Uh oh!

Uh oh!

Uh oh!

JJJYmmm commented Oct 30, 2025

Uh oh!

0cc4m commented Oct 30, 2025

Uh oh!

ngxson Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pt13762104 commented Oct 30, 2025

Uh oh!

ngxson commented Oct 30, 2025

Uh oh!

ngxson commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JJJYmmm commented Oct 30, 2025

Uh oh!

SharkWipf commented Oct 30, 2025

Uh oh!

Thireus commented Oct 26, 2025 •

edited

Loading

PaymonHossaini commented Oct 27, 2025 •

edited

Loading

ngxson Oct 30, 2025 •

edited

Loading

PaymonHossaini commented Oct 27, 2025 •

edited

Loading

ngxson Oct 30, 2025 •

edited

Loading

ngxson commented Oct 30, 2025 •

edited

Loading